Identification of outlier SNPs & genomic offset predictions
Published
October 15, 2024
1 Introduction
Document based on Gain and François (2021) and the LEA tutorial provided during the summer school Software and Statistical Methods for Population Genetics (SSMPG 2022; Aussois, September 19-23 2022).
The latent factor mixed model (LFMM) is a multivariate mixed regression model that estimates simultaneously the effects of environmental variables (fixed effects) and unobserved confounders called latent factors (Frichot et al. 2013; Caye et al. 2019). The latent factors are computed both from the genomes and from their environment. They are not representing neutral population structure (i.e. they have less direct interpretations than in ancestry estimation methods). Instead, they can be interpreted as the best estimates of the confounding effects of neutral population structure, leading to environmental effect size estimates with minimal bias.
2 Population structure
Code not included in the final version of the paper
Indeed, the analyses in this section are only useful if we want to estimate the population structure and impute missing data with the LEA package. In our study, we use the population structure estimates (the ancestry coefficients) from Jaramillo-Correa et al. (2015), and we impute the missing data based on the most common allele in the gene pool. And in the RDA, we do not use the ancestry coefficients to account for population structure; we use PC scores.
From the LEA package manual: The function snmf of the lfmm2 package estimates admixture coefficients using sparse Non-Negative Matrix Factorization algorithms, and provides STRUCTURE-like outputs. The input file has to be in the geno format, with one row for each SNP. Each row contains 1 character for each individual: 0 means zero copy of the reference allele. 1 means one copy of the reference allele. 2 means two copies of the reference allele. 9 means missing data.
Assuming \(n\) diploid organisms genotyped at \(L\) loci, the snmf algorithm decomposes the \(n × L\) matrix of observed allele frequencies, \(P\), in a product of two probabilistic matrices:
\[P \approx QF\]
where the coefficients of \(P\) take their values in {0,1/2,1} for diploid organisms (note: ploidy can be modified in snmf). The matrix \(Q\) is similar to the \(Q\) matrix of STRUCTURE, representing ancestry coefficients for individuals originating from \(K\) source populations (Pritchard, Stephens, and Donnelly 2000), and is a \(n × K\) matrix. The matrix \(F\) contains allele frequencies at each locus for each source population and is a \(K × 3L\) matrix. While \(P\) may contain some missing values, the product matrix, \(QF\), is always a complete probabilistic matrix. \(P\) is a \(n × 3L\) matrix.
Comment: The matrix \(F\) is called the matrix \(G\) in the LEA package and in Frichot et al. (2013).
Code
proj_snmf <-snmf(here("data/LEAanalysis/AlleleCounts_GenoFormat.geno"), K=1:10, repetitions =10, # nb of repetitions for each Kentropy =TRUE,project="new")
We plot the cross-entropy criterion of all run of the project.
Code
# We save the figure for the SIpdf(width =5, height =4, here("figs/LFMM/CrossEntropyPlot.pdf"))par(mar =c(4.5,4,1,1))plot(proj_snmf, cex =1.2, col ="lightblue", pch =19)dev.off()
png
2
Code
plot(proj_snmf, cex =1.2, col ="lightblue", pch =19)
We get the cross-entropy of the 10 runs for K = 6 and we select the run with the lowest cross entropy
Code
ce <-cross.entropy(proj_snmf, K =6)best <-which.min(ce)
We display the Q-matrix.
Code
gp_colors <-c("orangered3","gold2","darkorchid3","navyblue","turquoise2","green3") # define the colors of the gene poolsbp <-barchart(proj_snmf, K =6, run= best,border =NA, space =0, col = gp_colors, xlab ="Individuals", ylab ="Ancestry proportions", main ="Ancestry matrix")axis(1, at =1:length(bp$order), labels = bp$order, las =3, cex.axis = .4)
We use the G function to extract the ancestral genotype frequency matrix, \(G\), for the 2nd run for K = 6.
Code
Gmatrix <-G(proj_snmf, K =6, run = best)Gmatrix[1:5,1:6] %>%as.data.frame() %>%kable_mydf(boldfirstcolumn = F)
V1
V2
V3
V4
V5
V6
1.00
0.44
1.00
1.00
1
1.00
0.00
0.48
0.00
0.00
0
0.00
0.00
0.08
0.00
0.00
0
0.00
0.29
0.00
0.72
0.99
1
0.68
0.53
0.38
0.28
0.01
0
0.28
3 Climatic data
We load the population-specific climatic information for the climatic variables of interest. To run lfmm2, individuals (genotypes in our case) have to be in rows and climatic variables in columns.
The past and future climatic data have been scaled with the parameters (mean and variance) of the past climatic data (which is done by the function generate_clim_datatsets).
Code
# Selected climatic variables# ===========================clim_var <-readRDS(here("data/ClimaticData/NamesSelectedVariables.rds"))# Past and future climatic data# =============================source(here("scripts/functions/generate_scaled_clim_datasets.R"))clim_dfs <-generate_scaled_clim_datasets(clim_var)
We attribute climatic values to each genotype, i.e. genotypes from the same populations will have the same climatic values.
We load the genomic data. The genomic has to be allele counts without missing data, with individuals (genotypes) in rows and SNPs in columns.
Code
# we load the imputed genomic data with allele counts (and without MAF)geno <-read.csv(here("data/DryadRepo/ImputedGenomicData_AlleleCounts_withoutmaf.csv")) %>%column_to_rownames("snp_ID") %>%t() %>%as_tibble()
4.2 Run lfmm2
From Caye et al. (2019): LFMMs are regression models combining fixed and latent effects as follows:
\(\mathbf{Y}\) is the response matrix, which records data for \(n\) individuals genotyped for \(p\) genetic markers. \(\mathbf{X}\) is the matrix of the environmental or primary variables. Nuisance variables such as observed confounders can be included in the \(\mathbf{X}\) matrix, which dimension is then \(n \times d\), where \(d\) represents the total number of primary and nuisance variables. The fixed effect sizes are recorded in the \(\mathbf{B}\) matrix, which has dimension \(p \times d\). The \(\mathbf{E}\) matrix represents residual errors, and it has the same dimensions as the response matrix. The matrix \(\mathbf{W}\) is a “latent matrix” of rank \(K\),defined by \(K\) latent factors. The \(K\) latent factors represent unobserved confounders which are modeled through an \(n \times K\) matrix, \(\mathbf{U}\).The matrix \(\mathbf{U}\) is obtained from a singular value decomposition (SVD) of the matrix \(\mathbf{W}\) as follows
\[\mathbf{W} = \mathbf{UV}^T \]
where \(\mathbf{V}\) is a \(p \times K\) matrix of loadings. The \(\mathbf{U}\) and \(\mathbf{V}\) matrices are unique up to arbitrary signs for the factors and loadings.
As there are 6 gene pools in maritime pine (all represented in our population sample), we run the LFMM model with K=6.
Code
mod_lfmm2 <-lfmm2(input = geno,env = clim_ref, K =6)
The function lfmm2 returns an object of class lfmm2Class that contains the \(\mathbf{U}\) and \(\mathbf{V}\) matrices.
4.3 Calibration issues
With the function lfmm2.test, we can get a vector of p-values for association between loci and climatic variables adjusted for latent factors computed by lfmm2.
The full option:
If FALSE, the function lfmm2.test computes significance values (p-values) from standard Student tests for each climatic variable.
If TRUE, the function lfmm2.test returns p-values for the full set of climatic variables (a single value at each locus) using Fisher tests.
The genomic.control option:
If TRUE (default option), the p-values are recalibrated by using genomic control after correction for confounding.
If FALSE, the p-values are not recalibrated.
We can check if the p-values are well calibrated or not with the histograms of the p-values: ideally, they should be flat with a peak close to zero. in the two graphs below, we show the distribution of the non-calibrated (left graph) and calibrated (right graph) p-values. We can see that it is important to set the genomic.control to its default value TRUE if we want the p-values to be well calibrated.
Code
par(mfrow=c(1,2))# Histogram of non-calibrated p-values# ------------------------------------lfmm2.test(object = mod_lfmm2, input = geno, env = clim_ref, full =TRUE, genomic.control =FALSE)$pvalues %>%hist(col ="orange", main="Histogram of non-calibrated p-values",xlab="p-values")# Histogram of calibrated p-values# --------------------------------lfmm2.test(object = mod_lfmm2, input = geno, env = clim_ref, full =TRUE, genomic.control =TRUE)$pvalues %>%hist(col ="orange", main="Histogram of calibrated p-values",xlab="p-values")
In the following analyses, we use the default genomic.control=TRUE and full=TRUE, so all climatic variables are used in the test.
Code
test_lfmm2 <-lfmm2.test(object = mod_lfmm2,input = geno,env = clim_ref, full =TRUE,genomic.control =TRUE)pv_lfmm2 <- test_lfmm2$pvaluesplot(-log10(pv_lfmm2 ), cex = .3, col ="blue",xlab ="Locus", ylab ="-log10(P)", main="Manhattan plot of log10 p-values")
4.4 Multiple testing and calibration issues
We want to use the FDR control algorithm to correct for multiple testing and determine which loci have significant levels of association. The False Discovery Rate (FDR) is defined as:
FDR = prob(False Discovery | Positive test) = q.
The FDR algorithm requires that the tests are correctly calibrated, i.e. that the distribution of p-values is uniform when we assume that the null hypothesis, \(H0\), is correct. That’s ok in our case, we have already checked it above with the histogram of p-values.
Which FDR level do we use?
Code
fdr_level <-0.05
To identify the candidate SNPs, we apply the chosen FDR control to the p-values, which converts them into q-values. And then we identify candidates as those with q-values below a given FDR threshold.
The candidate SNPs at the FDR level of 5% are shown with circles on the Manhattan plot below. The orange line corresponds to the Bonferroni threshold for a type I error of 10%.
Code
# applying FDR control method to obtain q-valuesqv_lfmm2 <- qvalue::qvalue(pv_lfmm2, fdr.level = fdr_level)# Manhattan plotplot(-log10(pv_lfmm2 ), cex = .3, col ="blue",xlab ="Locus", ylab ="-log10(P)", main="Manhattan plot of log10 p-values")# Show with an orange line the Bonferonni multiple testing threshold for significanceabline(h =-log10(0.1/ncol(geno)), col ="orange")# Extract the list of candidatescandidates <-which(qv_lfmm2$significant)# Show with circles the list of candidate loci at the chosen FDR levelpoints(candidates, -log10(pv_lfmm2)[candidates], cex = .9, col ="brown")
Code
# We save the Manhattan plot for the Supplementary Information# ============================================================pdf(width =8, height =5.5, here("figs/LFMM/SI_ManhattanPlot.pdf"))par(mfrow=c(1,1))plot(-log10(pv_lfmm2 ), cex = .3, col ="gray",xlab ="Locus", ylab ="-log10(p-values)", main="Manhattan plot")abline(h =-log10(0.1/ncol(geno)), col ="orange")points(candidates, -log10(pv_lfmm2)[candidates], cex =1, col ="brown")dev.off()
Code
# We save the list of candidate SNPs for further analyses# =======================================================candidates <-names(geno)[candidates]saveRDS(candidates, here("outputs/LFMM/candidates.rds"))
Until now, methods to estimate the genomic offset has defined it as a distance in the genetic space (i.e. distance between allele frequencies). In the R packae LEA, the genomic offset is alternatively defined as a distance in the environmental space, i.e. measures of genomic offset are linked to the geometry of the ecological niche. This new definition of the genomic offset is referred as genetic gap and is implemented in the genetic.gap function of the LEA package.
The genetic gap is based on the estimates of environmental effect sizes obtained from an LFMM. The relationship inferred in the GEA is then used to fit and predict allelic variation at all genomic loci, alleviating the need for a set of candidate loci and the choice of a significance level.
5.1 Inputs
5.1.1 SNP sets
Below, we will estimate the genetic gap for:
some sets of candidate SNPs (see report 8_GeneratingSNPsets.qmd).
some sets control SNPs (see report 8_GeneratingSNPsets.qmd).
all SNPs (as advised by the developers of the genetic.gap function)
We load the predicted values of the climatic variables for the years 2041-2070 that have been scaled (i.e. mean-center) with the same scaling parameters as the ones used to mean-center the annual climatic values across the period 1901-1950.
Code
# Attribute climatic values for each genotypelist_clim_pred <-lapply(clim_dfs[[2]], function(clim_pred){genotypes %>%mutate(pop =str_sub(clon,1,3)) %>%left_join(clim_pred, by="pop") %>% dplyr::select(-pop,-clon,-gcm)})
The outputs of the genetic.gap function are (copy-and-pasted from the LEA manual):
offset: a vector of genomic offset values computed for every sample location in new.env and pred.env. Note that the genomic offset is referred to as the genetic gap in Gain and François (2021) and the geometric GO in Gain et al. (2023).
distance: a vector of environmental distance values computed for every sample location in new.env and pred.env. The distances to an estimate of the risk of nonadaptedness that includes correction for confounding factors and analyzes multiple predictors simultaneously (modified version of RONA).
eigenvalues: eigenvalues of the covariance matrix of LFMM effect sizes. They represent the relative importance of combinations of environmental variables described in vectors when the environmental data have similar scales. To be used with scale == TRUE.
vectors: eigenvectors of the covariance matrix of LFMM effect sizes representing combinations of environmental variables sorted by importance (eigenvalues).
5.2.1 Relationship with Euclidean climatic distance
In the supplementary information of Gain et al. (2023), the authors say that ‘If (and only if) the eigenvalues of the covariance matrix are equal, the geometric GO is proportional to the squared Euclidean distance between environmental predictors.’ Note that a variable \(y\)is proportional to a variable \(x\) if there is a non-zero constant \(k\) such as \(y = kx\).
In this part, we look at the relationship between the squared root of the genetic gap vs the Euclidean climatic distance (or the genetic gap and the squared Euclidean climatic distance).
The Euclidean climatic distance is calculated as follows:
with \(N\) the number of selected climatic variables, \(X_{past,i}\) the mean value of the climatic variable \(i\) across the period 1901-1950 and \(X_{fut,i}\) the predicted mean value of the climatic variable \(i\) across the period 2041-2070.
# Incorporating gene pool information# -----------------------------------# In the graphs below, we want to color the populations according to the main gene pool they belong to# we load the main gene pool information for each clonegps <-readRDS(here("data/GenomicData/MainGenePoolInformation.rds"))[[1]] %>%arrange(pop)
Code
# Function to build the plots source(here("scripts/functions/make_eucli_plot.R"))
Comment: The squared Euclidean climatic distance and the genetic gap are somewhat correlated, so it means that the genetic gap \(GO\) can be expressed as follows: \(GO = a \times dist^2 + b + e\) with \(dist\) being the Euclidean climatic distance and the \(e\) the residuals (ie the noise). Note that \(GO\) and \(dist^2\) are correlated but not proportional to!
5.2.2 Eigenvalues
Let’s look the the barplots of the eigenvalues.
Comment: The eigenvalues are properties of the model fit, and so only depends on the genomic data and the climate across the reference period. In other words, they do not vary across the different GCMs (ie different future climatic data). That’s why we generate only three barplots, one for each set of SNPs.
Code
par(mfrow=c(1,3))lapply(snp_sets, function(snp_set){ggap <-genetic.gap(input = geno,env = clim_ref,pred.env = list_clim_pred[[1]], # can be any element of the list, it does not mattercandidate.loci =which(names(geno) %in% snp_set$set_snps),K =6)barplot(ggap$eigenvalues, col ="orange", xlab ="Axes", ylab ="Eigenvalues",main=snp_set$set_name)})
We see that mainly one dimension of the climatic space influence the genetic gap. We can also see it with the loadings of the variables (vectors in the outputs of the genetic.gap function), which indicate the relative contribution of the variables to local adaptation. We see that the first variable had increased importance compared to the others (the variables are sorted by importance in the vectors outputs, so we do not know which variables they are).
5.2.3 Comparing GO predictions
We look at the correlation across the different genomic offset predictions at the location of the populations.
# Generate the maps for each set of SNPs and each GCM# ===================================================# Population coordinatespop_coord <-readRDS(here(paste0("data/ClimaticData/MaritimePinePops/ClimatePopulationLocationPointEstimates_ReferencePeriods_noADJ.rds")))[[1]]$ref_means %>% dplyr::select(pop,longitude,latitude)lapply(snp_sets, function(x) {# Find minimum and maximum values of genomic offset for the mapsgo_limits <-lapply(names(list_clim_pred), function(gcm){x$go[[gcm]]}) %>%unlist() %>%range()# The minimum GO value is very very small, almost zero, so we fix it to zero.go_limits[[1]] <-0go_maps <-lapply(names(list_clim_pred), function(gcm){df <- pop_coord %>%mutate(GO = x$go[[gcm]])make_go_map(df = df,plot_title = gcm,go_limits = go_limits,legend_box_background ="white",point_size =3)})legend_maps <-get_legend(go_maps[[1]])go_maps <-lapply(go_maps, function(y) y +theme(legend.position ="none"))go_maps$legend_maps <- legend_mapsgo_maps <-plot_grid(plotlist=go_maps)ggsave(here(paste0("figs/LFMM/GOMaps_PopLocations_",x$set_code,".pdf")), go_maps, width=10,height=6, device="pdf")ggsave(here(paste0("figs/LFMM/GOMaps_PopLocations_",x$set_code,".png")), go_maps, width=10,height=6)# =========# Add title# =========title <-ggdraw() +draw_label( x$set_name,fontface ='bold',x =0,hjust =0 ) +theme(plot.margin =margin(0, 0, 0, 7))# merge title and plotsplot_grid( title, go_maps,ncol =1,rel_heights =c(0.1, 1)) })
For each GCM, we attribute the value 1 to the top five populations with the highest genomic offset and we attribute the value 0 to the other populations. We then count the number of 1 for each population, which gives the table and map below:
Code
selected_SNP_set <-"common_cand"source(here("scripts/functions/make_high_go_pop_maps.R"))high_go_pops <-make_high_go_pop_maps(pop_coord = pop_coord,list_go = snp_sets[[selected_SNP_set]]$go,ggtitle =paste0("LFMM - ",snp_sets[[selected_SNP_set]]$set_name),nb_id_pop =5) # number of selected populations
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2 3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
# Load the climatic data of the NFI plots.nfi_clim <-readRDS(here("data/ClimaticData/NFIplots/NFIclimate.rds"))# Keep only the climatic variables of interest and scale the climatic datasource(here("scripts/functions/generate_scaled_nfi_clim_datasets.R"))nfi_dfs <-generate_scaled_nfi_clim_datasets(clim_var, clim_ref = nfi_clim$clim_ref, clim_pred = nfi_clim$clim_survey)
Code
# Calculate the genomic offset for the NFI plotssnp_sets <-lapply(snp_sets, function(snp_set){ggap <-genetic.gap(input = geno,env = clim_ref,new.env = nfi_dfs$clim_ref %>% dplyr::select(any_of(clim_var)),pred.env = nfi_dfs$clim_pred %>% dplyr::select(any_of(clim_var)),candidate.loci =which(names(geno) %in% snp_set$set_snps),K =6)snp_set$go_nfi <- ggap$offsetreturn(snp_set)})# map genomic offset predictions in the NFI plots lapply(snp_sets, function(x) {df <-readRDS(here("data/ClimaticData/NFIplots/NFIclimate.rds"))[[1]] %>%dplyr::select(contains("ude")) %>%mutate(GO = x$go_nfi) p <-make_go_map(df = df,type="NFI",point_size =0.5,go_limits =c(0,max(x$go_nfi)),legend_position =c(0.85,0.2),legend_box_background ="gray",y_limits =c(35, 51),plot_title = x$set_name)ggsave(here(paste0("figs/LFMM/NFI_GOmap_",x$set_code,".pdf")), p, width=6,height=6, device="pdf")ggsave(here(paste0("figs/LFMM/NFI_GOmap_",x$set_code,".png")), p, width=6,height=6)# p <- p + theme(plot.title = element_blank()) # to remove the title# Show maps in the Quarto document# ================================p })
We look at the correlation across the different genomic offset predictions in the NFI plots, i.e. those based on all SNPs and those based on sets of candidates or control SNPs.
# Common garden information# =========================cg_clim <-readRDS(here("data/ClimaticData/CommonGardens/ClimateCG.rds")) %>% dplyr::select(cg,any_of(clim_var))cg_coord <-readRDS(here("data/ClimaticData/CommonGardens/ClimateCG.rds")) %>% dplyr::select(cg,contains("ude"))cg_names <-unique(cg_coord$cg)# Climatic datasets for the predictions# =====================================# In these datasets, one row per prediction# We want GO predictions for each combination pop * CG, so 34 * 5 = 160 rows# climatic data of the populations# each row (ie each population climatic data) is repeated 5 timesclim_pop <-generate_scaled_clim_datasets(clim_var, clim_pred = cg_clim)[[1]] %>%slice(rep(1:n(), each =nrow(cg_clim)))# Climatic data in the common gardens scaled with the mean and variance of the ref-period climatic data# each climatic dataset is repeated 34 times and then rows are combinedclim_cg <-generate_scaled_clim_datasets(clim_var, clim_pred = cg_clim)[[2]] %>%replicate(nrow(pop_coord),.,simplify=F) %>%bind_rows()
Code
# Predict genomic offset of each population when transplanted in the climate of the common gardenssnp_sets <-lapply(snp_sets, function(snp_set){ggap <-genetic.gap(input = geno,env = clim_ref,new.env = clim_pop[,-1],pred.env = clim_cg[,-1],candidate.loci =which(names(geno) %in% snp_set$set_snps),K =6)snp_set$go_cg <-bind_cols(pop=clim_pop[,1],cg=clim_cg[,1]) %>%mutate(go=ggap$offset) %>%pivot_wider(names_from = cg, values_from = go)return(snp_set)})# Mapping# ========go_maps_cg <-lapply(cg_names, function(cg_name){p <-lapply(snp_sets, function(x) {df <- pop_coord %>%left_join(x$go_cg[,c("pop",cg_name)], by="pop") %>% dplyr::rename(GO=all_of(cg_name)) p <-make_go_map(df = df,point_size =3,type="CG",go_limits =c(0,max(x$go_cg[[cg_name]])),cg_coord =filter(cg_coord, cg == cg_name),legend_position =c(0.8,0.25),plot_title =paste0(str_to_title(cg_name), " - ",x$set_name))ggsave(filename =here(paste0("figs/LFMM/GOmap_",x$set_code,"_",cg_name,".pdf")), device ="pdf",width=6,height=6)ggsave(filename =here(paste0("figs/LFMM/GOmap_",x$set_code,"_",cg_name,".png")), width=6,height=6)# p + theme(plot.title = element_blank(), legend.position = "none") # to rm the titlep })plot_grid(plotlist=p,nrow=2)}) %>%setNames(cg_names)pdf(here("figs/LFMM/GOmaps_CGs.pdf"), width=20,height=10)lapply(go_maps_cg, function(x) x)dev.off()# show mapslapply(go_maps_cg, function(x) x)
We look at the correlation across the different genomic offset predictions in the common gardens, i.e. those based on all SNPs and those based on sets of candidates or control SNPs.
Code not included in the final version of the paper
The code below has not been used for the final version of the paper, as we finally did not use the Gradient Forest algorithm with allele frequencies corrected for population structure to identify outlier SNPs.
In a similar way to Gain et al. (2023) and Capblancq et al. (2023), we want to use allele frequencies corrected for population structure in the Gradient Forest analysis. Here we use the outputs from a latent factor mixed model (LFMM) to extract the corrected allele frequencies.
To understand, here the model applied by the lfmm2 function:
\[\mathbf{Y}_{fit} = \mathbf{XB}^T + \mathbf{UV}^T\] where \(\mathbf{B}\), \(\mathbf{U}\) and \(\mathbf{V}\) are the effect size, and factor and loading matrices adjusted by the lfmm2 algorithm from the set of current environmental variables included in the matrix \(\mathbf{X}\). \(\mathbf{B}\) is a matrix of dimension \(p \times b\), with \(p\) the number of genetic markers and \(b\) the number of environmental variables. \(\mathbf{U}\) is a matrix of dimension \(n \times K\), with \(n\) the number of individuals (ie genotypes) and \(K\) the number of latent factors. \(\mathbf{V}\) is a matrix of dimension \(p \times K\). \(\mathbf{X}\) is a matrix of dimension \(n \times b\). \(\mathbf{Y}_{fit}\) is a matrix of dimension \(n \times p\).
We want a matrix of allele frequencies corrected for population structure, so basically:
\[\mathbf{Y}_{corrected} = \mathbf{Y}_{fit} - \mathbf{UV}^T = \mathbf{XB}^T\] Below, we do the matrix multiplication of the matrix \(\mathbf{X}\) (dimension \(n \times b\)) and the transpose of the matrix \(\mathbf{B}\) (dimension \(b \times p\)) to obtain the matrix \(\mathbf{Y}_{corrected}\) (dimension \(n \times p\)).
Code
# we load the imputed genomic data with allele counts (and with MAF)geno_withmaf <-read.csv(here("data/DryadRepo/ImputedGenomicData_AlleleCounts_withmaf.csv")) %>%column_to_rownames("snp_ID") %>%t()# We run LFMM mod_withmaf <-lfmm2(input = geno_withmaf,env =as.matrix(clim_ref),K =6,effect.sizes = T) # to return the matrix B (the matrix of effect sizes)# Matrix multiplication of matrix X and the transpose of matrix Bgeno_corrected <-as.matrix(clim_ref) %*%t(mod_withmaf@B) %>%set_rownames(rownames(geno_withmaf)) %>%as.data.frame() %>%rownames_to_column(var="clon")# save the matrix for future analyses with and without MAFgeno_corrected %>%write_csv(here("data/GenomicData/CorrectedAlleleFrequencies_withmaf.csv"))geno_corrected %>% dplyr::select(clon,all_of(colnames(geno))) %>%write_csv(here("data/GenomicData/CorrectedAlleleFrequencies_withoutmaf.csv"))geno_corrected[1:10,1:10] %>%kable_mydf(boldfirstcolumn =1)
Capblancq, Thibaut, Susanne Lachmuth, Matthew C Fitzpatrick, and Stephen R Keller. 2023. “From Common Gardens to Candidate Genes: Exploring Local Adaptation to Climate in Red Spruce.”New Phytologist 237 (5): 1590–1605. https://onlinelibrary.wiley.com/doi/abs/10.1111/nph.18465.
Caye, Kevin, Basile Jumentier, Johanna Lepeule, and Olivier François. 2019. “LFMM 2: Fast and Accurate Inference of Gene-Environment Associations in Genome-Wide Studies.”Molecular Biology and Evolution 36 (4): 852–60. https://doi.org/10.1093/molbev/msz008.
Frichot, Eric, Sean D Schoville, Guillaume Bouchard, and Olivier François. 2013. “Testing for Associations Between Loci and Environmental Gradients Using Latent Factor Mixed Models.”Molecular Biology and Evolution 30 (7): 1687–99. https://doi.org/10.1093/molbev/mst063.
Gain, Clément, and Olivier François. 2021. “LEA 3: Factor Models in Population Genetics and Ecological Genomics with R.”Molecular Ecology Resources 21 (8): 2738–48.
Gain, Clément, Benedicte Rhone, Philippe Cubry, Israfel Salazar, Florence Forbes, Yves Vigouroux, Flora Jay, and Olivier François. 2023. “A Quantitative Theory for Genomic Offset Statistics.”bioRxiv, 2023–01.
Jaramillo-Correa, Juan-Pablo, Isabel Rodrı́guez-Quilón, Delphine Grivet, Camille Lepoittevin, Federico Sebastiani, Myriam Heuertz, Pauline H Garnier-Géré, et al. 2015. “Molecular Proxies for Climate Maladaptation in a Long-Lived Tree (Pinus Pinaster Aiton, Pinaceae).”Genetics 199 (3): 793–807. http://www.genetics.org/content/199/3/793.
Pritchard, Jonathan K, Matthew Stephens, and Peter Donnelly. 2000. “Inference of Population Structure Using Multilocus Genotype Data.”Genetics 155 (2): 945–59.